Common visual recognition tasks such as classification, object detection, andsemantic segmentation are rapidly reaching maturity, and given the recent rateof progress, it is not unreasonable to conjecture that techniques for many ofthese problems will approach human levels of performance in the next few years.In this paper we look to the future: what is the next frontier in visualrecognition? We offer one possible answer to this question. We propose a detailed imageannotation that captures information beyond the visible pixels and requirescomplex reasoning about full scene structure. Specifically, we create an amodalsegmentation of each image: the full extent of each region is marked, not justthe visible pixels. Annotators outline and name all salient regions in theimage and specify a partial depth order. The result is a rich scene structure,including visible and occluded portions of each region, figure-ground edgeinformation, semantic labels, and object overlap. We create two datasets for semantic amodal segmentation. First, we label 500images in the BSDS dataset with multiple annotators per image, allowing us tostudy the statistics of human annotations. We show that the proposed full sceneannotation is surprisingly consistent between annotators, including for regionsand edges. Second, we annotate 5000 images from COCO. This larger datasetallows us to explore a number of algorithmic ideas for amodal segmentation anddepth ordering. We introduce novel metrics for these tasks, and along with ourstrong baselines, define concrete new challenges for the community.
展开▼